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Abstract 

We consider the problem of decision-making with side information and 
unbounded loss functions. Inspired by probably approximately correct 
learning model, we use a slightly different model that incorporates the 
notion of side information in a more generic form to make it applicable to 
a broader class of applications including parameter estimation and system 
identification. We address sufficient conditions for consistent decision- 
making with exponential convergence behavior. In this regard, besides 
a certain condition on the growth function of the class of loss functions, 
it suffices that the class of loss functions be dominated by a measurable 
function whose exponential Orlicz expectation is uniformly bounded over 
the probabilistic model. Decay exponent, decay constant, and sample 
complexity are discussed. Example applications to method of moments, 
maximum likelihood estimation, and system identification are illustrated, 
as well. 

1 Introduction 

Decision-making refers to a system-theoretic problem of choosing among alter- 
native options in light of their possible outcomes. Analytically, it is an optimiza- 
tion problem where the decision-maker's goal is to find optimal or suboptimal 
points, known as actions, to attain objectives that are quantified by certain util- 
ity functions. Most often, depending on the complexity of an underlying system, 
in particular in stochastic systems, a decision- maker suffers from uncertainties 
in determining the outcome of an action. In such systems, a decision-maker can 
use side information to reduce the risk in making decisions. This basically is 
a common scenario in applications such as learning theory, hypothesis testing, 
inventory control, parameter estimation, pattern recognition, system identifica- 
tion, prediction, and filtering. Hence, analytical results and arguments within 
the generic model of decision-making under uncertainty provide important in- 
sights and tools in treating the aforementioned applications. 
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In this regard, learning theory has been one of the major areas that has 
been treated from a decision-making viewpoint [1]. Learnability or consistent 
decision-making is a property that is attributed to a class of loss functions that 
satisfy certain analytical requirements. In learning theory whether supervised 
or unsupervised, the true risk value for an action, known as generalization error, 
is measured as the expected loss over the observation space, where the governing 
probability distribution belongs to a known class of probability measures called 
the probabilistic model. One of the main challenges in verification of learnabil- 
ity is to prove the existence of a learning algorithm or decision policy that can 
use a sequence of observations and take actions whose true risk converge to the 
minimal risk as the number of observations grows to infinity. In this regard, 
the original model for learnability introduced by Valiant [2], known as probably 
approximately correct (PAC) learning, and its decision-making based general- 
ization by Hausslcr [1] have been shown to be strongly related to the existence 
of uniform laws of large numbers [3] , [4] . 

Much work in this area addresses and discusses consistency for the case of 
bounded loss functions. The importance of bounded loss functions is that they 
allow consistency analysis over universal probabilistic models, also known as dis- 
tribution free models. To this end, the most commonly used decision policy is 
the empirical risk minimization (ERM) method [4] . It is known that the ERM 
policy is universally consistent if the collection of the level sets of the class of loss 
functions has a finite VC-dimension [4], [5]. While these results on bounded loss 
functions provide important insights on applications such as pattern recognition 
and neural networks, many other applications including regression analysis, pa- 
rameter estimation, and system identification motivate consistency analysis of 
decision-making with unbounded loss functions. Although there are some suf- 
ficient conditions for relative uniform convergence of empirical risk functions 
[4, Ch. 5], results on {absolute) uniform convergence are more insightful and 
desirable in assessing the consistency of the ERM policy. 

With this motivation, and inspired by [1], in this work we study the problem 
of consistent decision-making using unbounded loss functions. To state and 
formulate the problem, we use a model that is slightly different from the one 
commonly used in learning theory. This model explicitly uses generic notions of 
solution defining operator and the side information. Adopting such an approach, 
we aim not only to make the final results span a larger extent of applications, 
but also to provide a more intuitive method in articulating other applications 
into this model. 

We provide sufficient conditions for consistency of the ERM policy with 
exponential convergence behavior. These conditions are also sufficient for strong 
consistency of the ERM policy. Unlike the case of bounded loss functions, in this 
case, besides the behavior of the growth function of the class of loss functions, 
asymptotic behaviors of the probabilities of the level sets of the loss functions 
are important in determining consistency and convergence behavior of the ERM 
policy. More precisely, it is shown that to have an exponential convergence 
behavior, including a certain condition on the growth rate of the class of loss 
functions, similar to the case of bounded loss functions, it suffices to have a 
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dominating function (for the class of loss functions) whose exponential Orlicz 
expectation is uniformly bounded over the probabilistic model. In contrast to 
the case of bounded loss functions, the decay exponent is strictly smaller than 
one. On the other hand, the expression of decay constant is very similar to its 
counterpart in the bounded case. Sample complexity and the effect of different 
parameters on it are also addressed. Furthermore, example applications to 
method of moments, maximum likelihood estimation, and system identification 
are illustrated. 

2 Analytical Setup 

Let H be a subset of a separable normed space that denotes the state space of a 
given system. For example, in channel estimation for a mobile communication 
system, TL is defined as the inclusion of all possible channel transfer functions 
between the transmitter and receiver. In parameter estimation, TL is defined as 
the inclusion of all unknown parameters of interest. Although in the provided 
examples, one seeks to find a good estimate for the unknown channel state or 
the unknown parameter to minimize some error criteria, in general, one seeks 
a good action to attain certain objectives. Let Q be a compact subset of a 
Euclidean space called the action space, where one takes an action g G Q to 
attain an objective, measured as expected loss. 

In stochastic systems, associated with any state and any action, there may 
exist a range of possible outcomes whose likelihood is determined by the prob- 
abilistic model that governs all uncertain underlying parameters affecting the 
outcome. For example, in a linear additive noise channel, the additive noise is 
the uncertain parameter that affects the output of the channel, i.e., the out- 
come. Let hi be a subset of a separable normed space that models the range of 
such parameters. We call hi the uncertainty space. 

Suppose the system is in state h G TL and the uncertain parameter is u G hi. 
If there exists an oracle who provides perfect information about h and u, one 
could uniquely determine the outcome for each g G Q and pick the action that 
is most favorable. In practice, however, elements h and u are not given and 
instead some side information about them is known. Let O be a subset of a 
Euclidean space called the observation space. Side information is generated by 
a sequence of mappings I = (I n ) n gN such that for each n, I„ : TL x hl n — ► O n , 
where I„ is formed as a stationary extension of a function I : TL x ht — > O 
called the information function. For an underlying pair (h,u n ), I n generates 
a multi-sample o" = (oi)" =1 G O n as side information. Generation of the side 
information is assumed to be governed by an independent, identically distributed 
(i.i.d.) stochastic process as follows. 

Let (fi, J^") be a measurable space, where f2 stands for a sample space and & 
denotes its Borel cr-field. Let = {Pe ■ G 8} be a collection of probability 
measures defined over (f2, For each 9, let {l^}„ e N denote an i.i.d. stochastic 
process ranged over IA whose distribution is induced from Pg. As a result of 
information function, for each h G TL, we obtain an i.i.d. stochastic process 
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{0^ h }, called the side information process, where at each time instance n, 
0^ n : £1 — > O is a random element whose induced probability measure is denoted 
by Pe,h- For every Borel-measurable E C 0, P 6th {E) is described by: 

ffc.ftOE) = Pe(ueU: (h,u) e I'^E)). (l) 

We denote the class of all induced probability measures on O by 8?%,n- We 
assume that for every bounded subset E C O, there exists a probability measure 
-P0,7i € ^e,^ such that Pg t h{E) < 1. This condition ensures that the collection 
^e.-H is n °t of bounded support. 

Let Q: O x Q ^>Rbe & lower bounded, lower semicontinuous loss function 
that determines the penalty of taking action g e Q when the side information 
o e is given. We denote the dass of loss functions by J? = {Q(-,5)} 9 eg- We 
assume all the elements of £2 arc measurable and for every g e Q, there exists 
no distinct g' £ Q such that Q(o,g') < Q(o,g) for all o E O. 

Suppose the underlying probability measure is Pg and the system is in state 
h. As the generation of side information is governed by Pe,h, the risk (expected 
loss) of an action g G Q is determined by 

Mh,g) = j Q{o,g)dP e , h . (2) 

The objective of a decision-maker is to minimize the risk in decision-making. 
An action with lower risk is more favorable and an action that minimizes the 
risk (upon existence) is an optimal action. Depending on the application, a 
decision-maker is also interested in actions whose risks are close to the minimal 
risk to some desired accuracy. Thus, it is of interest to investigate whether a 
decision-maker can take such actions with a desired confidence in obtained risk. 

Definition 2.1 (Solution defining operator). Let Jg(h,-) be lower semi- 
continuous over Q for every 9 <G 9 and h e H. For every h e H, 9 e 6, and 
accuracy e > 0, let 

S fl (M) = {9 G Q ■ Mhg) < inf J g (h,g')+e} (3) 

denote the e- approximate 9-solution set for h. We call S: H x x R + — > 2 s a 
solution defining operator. 

To illustrate how Definition 2.1 can be used in practice, consider the following 
examples. 

Example 2.1 (Minimum mean square error). Consider a linear system 
with input- output relation 

y = (h, x) + v, 

where heH(ZR m , x e X = W n ,y e y = R,v e V = R denote its state vector, 
input vector, output value, and noise value, respectively. Let Ti. be compact and 
let Q = H. Suppose 3?® is a tight collection of probability measures. Taking 
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U = X xV , O = X x y , and Q((x,y),g) = \y — (g,x)\ 2 , the mean square error 
e- approximate 9-solution set for h is 

Se(h,e) = {geG:J \y - (g,x)\ 2 dP e , h < urf J \y - (g',x)f dPj, h +e|. (4) 



Example 2.2 (Maximum likelihood). Let (0, P) be the underlying prob- 
ability space, i.e., O is a singleton. Let X C R m and H be a compact subset of 
a Euclidean space. For every h € H, let Xh be random variable over X whose 
induced probability measure P^ is absolutely continuous with a continuous den- 
sity function fh{x) such that swp h x fh(x) < oo. Taking Q = H,U = fl,0 = X, 
and Q{x, g) = — log 2 f g (x), the maximum likelihood e-approximate solution set 
for h is defined 

S(h, e) = L e G : - J log 2 f g dP h < brf - J log 2 f g ,dP h + e| . (5) 



As it is evident in Definition 2.1 as well as the aforementioned examples, 
determination of e-approximate ^-solution sets are dependent on the knowl- 
edge of the underlying probability measure, Pg t h- If an oracle reveals Pe.h, for 
any accuracy e > 0, a decision-maker is able to find an action that belongs to 
S$(h, e). In practice, however, such an oracle does not exist. Instead, we are 
provided with some side information in the form of a sequence of observations 
o 11 = (oi, 02, . . . , o n ), and we seek an action g e Q whose risk, with high proba- 
bility, is within e accuracy of the minimum risk for sufficiently large n. In other 
words, for each n, it is of interest to take an action g n such that all but a finite 
number of elements of the sequence (g n ) n eN ue 7 with arbitrary high probabil- 
ity, in Sg(h,e). This requirement is called consistency that is more precisely 
expressed as follows. 

Definition 2.2 (Consistent decision policy). Let A = (A„) be a sequence 
of decision rules A n : O n — ► Q such that A n (o n ) is an element of Q . We call A 
a uniformly e-consistent decision policy if there exists e such that 

limsupp(A, e, n) = 0, (6) 

where 

p{A,e,n)± sup sup P (lo e fl : A n {O n (to)) <£ S e (h,e)) (7) 
see hen 

is the worst case probability that the risk associated with decision rule A n is 
not within e accuracy of the minimal risk. Correspondingly, A is said to be 
uniformly consistent if it is uniformly e- consistent for every e > 0. Moreover, 
if instead of convergence in probability, almost sure convergence occurs, the 
decision policy is said to be strongly consistent. 
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Definition 2.2 provides important tools to assess the performance of a deci- 
sion policy. For example, for a system identification algorithm, convergence in 
the sense of (6) implies that for every desired level of accuracy, there exists n 
such that if the length of side information is n > n a , the algorithm attains an 
estimate of the system's state within the desired accuracy with arbitrarily high 
probability. Intuitively, the smaller the n a is, the more efficient the algorithm 
is using the side information. 

3 Empirical Risk Minimization 

Any sequence of decision rules that satisfies the conditions of Definition 2.2 is 
a consistent decision policy. However, verification of its consistency and inves- 
tigation of its convergence behavior may not be theoretically possible. As a 
result, much of the work in literature, in particular in learning theory, has been 
concentrated on a certain form of decision policy that is based on empirical risk 
minimization (ERM) denoted by A^rm [4]. 

Let o n = O n (ui) denote the beginning segment of a realization of the side 
information process. The empirical risk function is defined as 

1 " 

J (n) (9)^-J2^9,o l ). (8) 

Let i/ n ) = infggg J^ n \g). Decision rule A n takes an action 

A n (o n ) e {g e G : JW(s) = i/W}, (9) 

as the image of side information o" in the action space. We now state and prove 
a result that indicates that to prove ^-consistency of the ERM decision policy, 
it suffices to prove the uniform convergence of the empirical risk functions. 

Lemma 3.1. For every 9 € 0, h € H, and e > 0, the following inequality holds 
true 

P e {uje£l:A n {O n {uj))i$ e {h,e)) 

< P e (u e n : sup\J e (h,g) - J (n \g)\ > e/2). (10) 
geg 

Proof. Let Jg{h) = inf 9e g Jg(h,g). One can verify that 

*„, 9 (M,5) A n (O n (uj)) £ S e (h,e)} 

= {lo e n : 3g e G, J (n) (g) - v {n \ J e (h,g) > J° e (h)+e}. 

Let define 

A n (e) = {wefi: J^h) - v {n ^ > -e/2} 
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and its complement AJj(e) = 0\A„(e). One can verify that 

*n,o(/i, e, <*) C (*„, e (/i, e, <J) n A„(e)) U A= (e). (11) 

Consider the event ^ n ,e{h, e, S) n A„(e) and let o> be an outcome belonging to 
this event. By definition, there exists a g <G Q that satisfies the conditions of 
^f n ,e{h, e, 5). We can verify that for outcome u, the inequality 

Je(h,g) - J {n) {g) > e + J° e {h) - i/W > e/2 
holds true. As a result, for every outcome u e ^ n ^(h, e, 5) n A„(e), we have 

svp(j e (h,g)- JW(5)) > e/2, (12) 
see 

which implies that 

* n , e (M,<*) n A„(e) C { w G n : sup(j e (^, ff ) - J^(g)) > e/2}. (13) 

see 

Now consider an outcome w e A^j(e) and note that 

- J e °(/i) > e/2 (14) 

holds for this outcome. Suppose J$(h, •) attains its minimum at a point g° ^ Q 
with the minimum value of Jjfth) = J e (h,g°). Noting that J (n) (3°) > we 
have jW(g°)-Je(h,g°) > e/2 that implies sup geg ( (,g°) - Je{h,g°)) > e/2. 
As a result, we deduce that 

A c n (e) C e n : sup(j(")(g) - J e (/i, 5 )) > e/2}. (15) 
see 

Using (11) along with (13) and (15), we can conclude the assertion. Note that 
the right hand sides (RHS) of both (13) and (15) are measurable sets as we 
assumed both Q and £! are separable. Hence, in asserted inequality, the outer 
measure in the left hand side (LHS) is replaced by measure in the RHS. □ 

To investigate the consistency of the ERM policy, it suffices to show that 
the right hand side (RHS) of (10) converges to zero as n grows to infinity. For 
the case of bounded loss functions, it is known that to have convergence, it is 
sufficient that the VC-dimension (to be described shortly) of the class of loss 
functions is finite [3], [4], [5]. Hence, provided that the class of loss functions 
has a finite VC-dimension, the decision policy is universally (distribution free) 
consistent and the convergence has an exponential behavior. For the case of 
unbounded loss functions, however, universal consistency does not hold, and 
additional analytical restrictions should be applied to the class of loss functions, 
=2, and the collection of probability measures, ^bm- 
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3.1 Growth function 



Uniform convergence property is closely dependent on the existence of certain 
simplifying structures on i? that restrict the growth function of the class of loss 
functions [4]. 

Let L = inf 0>9 Q(o, g). For every c > and every Q(-,g), define a function 
Qc ■ O x g -> R such that 

Qc(o, g) = min{Q(o, g), c + L}. (16) 

Let £! c — {Q c {-,g) ■ Q(-,g) G J3} denote the class of truncated loss functions. 
For every pair c and g, let Ctg {a) = {o e O : Q c (o,g) > a} denote the level 
set of Q c (o,g) corresponding to level a > 0. Let I denote the binary indictor 
function. Suppose a sequence of observations o n = (oi, 02, . . . , o n ) is given. Let 
define 

•2c(o") - {(vi, a , • • • , «„,„) e {0. 1}" = «i,a - I(oj G O c , fl (a)), Va > L, g G e}.(17) 

Intuitively, £H c (p n ) is the set of all distinct vertices on {0, 1}™ that are spanned 
by the image of the observation vectors under the characteristic functions of the 
level sets. Let N(£! c , o n ) denote the cardinality of J2 c (o n ). The growth function 
of i? c is defined 

G(=2 c ,n) = max \nN{£ c ,o n ). (18) 

The VC-dimension, named after Vapnik and Chervonenkis, of =S C is defined 

d c — max{n G N : G(J2 C , n) = n In 2} (19) 

and described as follows. For a sequence of observations o n , let B denote the set 
of its elements. B is said shattered by J2 C , if for every subset E of its elements, 
there exists a and g such that for every o e E, Q c (o,g) > a, and for every 
element o e B — E, Q c {o, g) < a. The VC-dimension of £? c is the cardinality of 
the largest set B that can be shattered by J2 C . If J2 C has a finite VC-dimension 
d c , then it is known that [4, Thm. 4.3] 

G(£? c ,n) = nln2, if n < d c , 

G(=2 c ,n) < rf c ln— , otherwise. (20) 

If d c is uniformly bounded over c, then d = lim^oo d c and 

G(£!,ri) = lim sup G(£ c ,n) 

denote the VC-dimcnsion and the growth function of =2, respectively. 

Lemma 3.2. For an action g, let J Cj g(h,g) and ji n \g) denote the risk value 
and empirical risk value that are obtained using the truncated loss functions, 
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Qc(-,g), respectively. Then, 

Pe(uen: sup | J c . e (h,g) - J™{g)\ > e/2) 

/or every n > where £* = e — — . 

Proof. The proof follows by [4, Theorem 5.1] where the lower bound on n is 
imposed by a well known result on symmetrization [6, Lemma 11.5]. □ 

Lemma 3.2 provides a sufficient condition for uniform convergence for the 
bounded class of truncated loss functions =2 C . It is seen that to have uniform 
convergence, it suffices that limsup^^^ G (^=> 2 ") = o. By (20), if the VC- 
dimcnsion of J3 C is finite, then (21) converges to zero and uniform convergence 
occurs. A generalization of VC-dimension is also known to be both necessary 
and sufficient for uniform convergence [5]. Note that the asymptotic uniform 
convergence behavior for Q c is 0{e~^ lcn > °) where the decay exponent is q c = 1 

2 

and the decay constant is 7 C = 4^ • 



3.2 Dominating function 

To generalize Lemma 3.2 to unbounded loss functions, let M: O — > R + be a 
measurable function such that for all o <G O and g € G, Q(o,g) — L < M(p). 
Such a function is called a dominating function for J3. To have exponential 
convergence in (10), we need to apply some restricting conditions on M. 

Definition 3.1 (Orlicz expectation). Let ip: M + — > IR + be a measurable, 
nondecreasing function. For an underlying probability measure P, the quantity 

||A%4inf j c >0: J <p(\M\ /c)dP < 1 J . (22) 

is called the Orlicz expectation of M. Moreover, if ip is convex, then ||M||^, is 
called the Orlicz norm of M } 

To emphasize the tail behavior of density functions, we consider a special 
form of functions ip p : R — > M + , p > 0, such that ip p {x) =e' a: ' P — 1. The following 
result, which is adopted from [7] and generalized for any p > 0, motivates the 
usage of such functions. We call p and ||M||^, p the exponential expectation order 
and the exponential Orlicz expectation of M of order p, respectively. 

Lemma 3.3. For any probability measure P G ^e,n an d an V P > 0, the 
following are equivalent: 

I \\MU p <^. 

1 One special class of Orlicz norms is the class of Lp-norms. 
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2. There exists constants < R, S < oo such that 

P(\M\ > c) < Re- Sc \ for all c> 0. (23) 

Moreover, if either condition holds, then R = 2 and S = ||M||^ P satisfies (23). 

Proof. Suppose the first statement holds. Then, by Markov's inequality, we 
have 

p(\m\ > C ) < p(M\ M \/\\ M UJ>M c /\\ M hS) 

<min{ 1 A<2e^ M K c \ 

Now, suppose the second statement holds. Using Fubini's theorem [8] to ex- 
change the order of integration, we obtain 



/f- f-C • LVl ' 
( e c-*M? _ y dp = J J e *dxdP 

= / P(M > cx l / p )e x dx 
Jo 



I 

Jo 



ScP - 1 

that means for c > (^-p-) 1 ^ the integration in the LHS is no larger than one. 
This indicates that ||M||^ < (^) 1/P < oo. □ 

In other words, Lemma 3.3 means that to have the probability of the level 
sets of the dominating function converge to zero, exponentially, it is necessary 
and sufficient that ||M||^, < oo for somep > 0. Intuitively, Lemma 3.3 suggests 
that to obtain an exponential upper bound on (10), it suffices to have ||M||^, p 
uniformly bounded for some p > 0. 

3.3 Sufficient condition 

Using Lemmas 3.2 and 3.3, we now state the key result of this work. 

Theorem 3.1. For a given class of loss functions £1 and the collection of prob- 
ability measures £?% t u, suppose there exists a dominating function M , p > 0, 
and p < oo such that sup Pg ^ e ||Af ||^ p = p. Then, 

1. The following inequality holds true 

sup sup P e {uo e ft : sup I J e (h,g) - J (n) {g) \ > e/2) 
eeShen g eg 

< (2n + 4cxp {G(^ c(n) , 2n)}) exp {- (1 Z^^- n"} (24) 
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2 + P . 



forn > maxj 2 ^ , (in " } where c(n) = 16 *+p p^rt e 2 +pn 2 +? 

andq = ^. 

2. If limsuprc^^ G ^ c ^)' 2 "^ = 0, t/ie« i/ie empirical risk minimization ap- 
proach is consistent. 

Proof. Without the loss of generality, we may assume that L = 0, otherwise, 
take Q = Q — L. For a given loss function Q and a positive value c > 0, let Q c 
be defined as (16) and Q c — Q — Q c . Moreover, let J c ^(h,g) — J Q c (o, g)dPg^h 
and Jg(h,g) — J Q c (o, g)dP g ^ h . Let li n \g) and I°'( n \g) be defined, similarly. 
One can verify that for every c > 0, we have 

P e {u e ft : sup\Je(h 7 g)-J ( - n) (g)\ > e/2) 
see 

<P 6 (uen: sup | J c , e (h,g) - J c (n) (s)| > e/4) 
see 

+ P (w e n : sup| J^h,g) - J c * n \g)\ > e/4). (25) 
see 

For the dominating function M let define functions M c = min(M, c) and M c = 
M — M c . For the second term of (25), we note that 

Pe(u; e n : sup| J e c (M)- ^' (n) (.9)| > e/4) 
see 



< 



F 9 (wefi: /" M c dP Bth > e/8) 

P e (w e fi : y M c dP {n) > e/8). (26) 



It can be verified that 



F 9 (weO: / M c dP e . h > e/8) <P 9 (we!l: / MdP 0Jl > e/8) 

J JM(o)>c 

= P 6 (lu e fl : y Pfl,h(M > c + a;)<ix > e/8) 
(By Lemma 3.3) < P (lu ^ ^ ■ J 2e- p ~" {c+x) " dx > e/8) 

< Pe(c , e n : -il— > e/8), 

cP L p 



where by taking c > c th ~ py fn 1 ^ ) 2p , we enforce the LHS to be zero. Now, 
consider the second term of the RHS of (26). We have 

P«(ue!!: y M c dP^ > e/8) <nP 8 (weSl: Af(0(w)) > c + e/8) 

(By Lemma 3.3) < 2nexp (-p~ p c p ). (27) 
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Moreover, by applying Lemma 3.2 to the first term of the RHS of (25), we have 
Peiuj e n : S up\J Ci e(h,g)-4 n \g)\ > e/4) <4exp |( G( ^ C ' n) --^) n |(28) 

for every n > ^|§-, where = s — 4/n. Provided that c > c t h, the asymptotic 
behavior of the upper bound is controlled either by (27) or (28). To obtain the 

2 

tightest bound, we pick c such that p~ p c p ~ j^i- As a result, we obtain 

1 p 2 1 

c = 16 2+pp2+ P £ 2+ P n 2 +p . 

Note that for the case of bounded functions as p increases, c goes to p. For the 
general case where p < oo, we need to have c > c t h that requires 

2 + p 

16 P 2 / 16 P 2 p 
n > — — In 



ep 

Moreover, substituting this c in condition n > ^§§-, one can verify that for 



16p 2 , 
n > — — max-, 

ep 



the following inequality holds true 

sup sup Peitoen-. supl J e (h,g) - J (n) (g)\ > e/2) < 
eee hen g eg 

(l--t)V' 



^2n + 4cxp{G(^ c ,n)}^ exp|- 



where g = 

The proof of the second assertion follows by the fact that the given hy- 
pothesis provides a sufficient condition for convergence of the RHS of (24) to 
zero. This convergence means that the ERM policy is e-consistent. Because 

1 p 2 1 

c = 16 2+p p 2+p £ 2+p n 2+p i s an increasing function with respect to e, G(J2 c ,n) 
is non-increasing as e decreases. This implies the second assertion. □ 

Theorem 3.1 embodies key results in assessing uniform convergence and its 
behavior for decision-making with unbounded loss functions. Equation (24) is 
an upper bound on the worst case probability of taking an action whose risk is 
not within e accuracy of the minimal risk. Thus, convergence of (24) to zero 
implies e-consistency of the decision policy. By second assertion, convergence 
occurs provided that the collection of truncated loss functions, i? c („) , satisfies 

hmsup — = 0, (29) 

n — >oo ffl 

where c(n) is described as in the assertion. 
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By (20), if =2 has a finite VC-dimension, then G(<£ c{n) ,2n) < G{B,2n) = 
O(lnn) implying that (29) holds true for any q > 0. This implies that for an 
unbounded class of loss functions with finite VC-dimension, the ERM policy is 
consistent, if there exist a dominating function M and an exponential expecta- 
tion order p > such that sup Pg ^ e < oo. By Borel-Cantclli Lemma 
[9] , these conditions are also sufficient for strong consistency of the ERM policy. 

Upon satisfaction of (29), by (24), it appears that the asymptotic behavior of 
the convergence is described by 0(e~ iqn "), where 7 = is the decay constant 
and q is the decay exponent for the case of unbounded loss functions. The decay 
exponent, q = 5+^7 is an increasing function of p. We note that q < 1 meaning 
that the decay exponent for unbounded loss functions is strictly smaller than 
one which is in clear contrast from the case of bounded loss functions as seen 
in (21). On the other hand, the decay constant 7 = jjj-y is very similar to the 
case of bounded loss functions where the constant c is replaced by p. Clearly, 
in both cases, as either c or p increase, the decay constant decreases imposing 
a slower convergence. 

Given a parameter (3 £ [0, 1] and a desired level of accuracy e, sample com- 
plexity of ERM policy is defined as 



Sample complexity determines the number of observations needed by ERM de- 
cision policy to take actions whose risk is within accuracy e of the minimal risk 
with a confidence of at least 1 — (3. Assuming that the class of unbounded loss 
functions, =2, has a finite VC-dimension of size d, we obtain n(e, (3) <n Ql where 



Equation (31) is a novel expression that demonstrates the effect of different 
parameters in determining the sample complexity. It can be seen that the accu- 
racy e and the least upper bound on the exponential expectation p have more 
dramatic impact on n D comparing to the confidence parameter (3. Moreover, 
as the decay exponent, q, increases, the sample complexity decreases. Due to 
dependency of both q and p to the tail behavior of the density functions, it is 
expected that densities with heavier tail result in smaller q and larger p, hence, 
larger sample complexity. 

4 Applications 

As mentioned earlier, we adopted a decision-making framework since it is a 
generic representation for different applications. To further clarify how the 
derived results can be used in practice, we now use this model in some example 
applications. 



n(e,(3) 



inf{n e N : p(A ERM , £, n) < (3}. 



(30) 




(31) 
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4.1 Method of moments 



Parameter estimation through method of moments is a realization of substitu- 
tion estimators that is described as follows [10, Ch. II]. Let X be a, subset 
of a Euclidean space that denotes the observation space. Let (O, P) be a 
probability space. Let H C M. 1 be a compact, convex subset for some integer 
I > such that for every h £ H, Ph is a probability distribution of a random 
variable Xh : — > X whose distribution is induced by P. The collection of such 
induced probability distributions is denoted by S^t-l- It is assumed that there 
exists a measurable function <f> : X — > R l such that for an underlying probability 
measure Ph € ^n, onc obtains h through a multidimensional integration as 
follows 



h 



J <j>{x)dP h . (32) 



For example, suppose X = W and function <f> is described as 



<P(x) 



for some m € N, 



which implies that parameter h is the m-order moment of the underlying dis- 
tribution. In practice, since a sequence of observations x n = (x\, X2, ■ ■ ■ , x n ) is 
given instead of Ph, an estimate of h is obtained by substituting an empirical 
measure P^ in (32) that generates an estimate 

= / Mx)dP^ = - V (33) 
J 

It can be seen that is not necessarily in Ti for all observed sequences. In such 
incidences, the estimator picks a boundary point h — argmin/jg-^ \\h— h n \\ 2 as 
the estimate. For a e > 0, the estimator is said to be ^-consistent if 

limsup sup P h (0 n} - h\\ 2 > e) = 0. (34) 
n^oo hen 

Moreover, if 

sup iyiimsup W'h^ - h\\ 2 > e) = 0, (35) 

hen n— >oo 

the estimator is said to be strongly e-consistent. In either case, if the estimator 
is (strongly) £-consistent for every e > 0, it is said to be (strongly) consistent. 
Sufficient conditions for consistency of these estimators have been previously 
addressed in the literature [10, Ch. II. 4]. Here, we demonstrate how similar 
results can be easily obtained from the results of this work. 

To articulate this problem into the generic model that is used in this work, 
let Q = Ti. denote the action space, and let Q(x,g) — \\(j>(x) — g\\ 2 describe the 
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loss function. The risk (error) of an action (estimate) g when the underlying 
parameter is h is 



Let (<7fc) be any sequence in Q such that g k — ► 5. By Fatou's Lemma [8], 



This means that J(h,g) is lower semi-continuous; hence, Definition 2.1 is ap- 
plicable to the method of moments. Since Q is convex and Q(x,g) is strictly 
convex over Q, the solution of empirical risk minimization, (9), is unique and 
equal to the solution obtained from the method of moments. Thus, the notion of 
consistency defined in (35) is equivalent to the one defined in Definition 2.2. As 
a result, the analytical arguments shown in this work can be used to investigate 
consistency and exponential convergence behavior of the method of moments. 
Since the VC-dimension of the class of loss functions is upper bounded by I + 1, 
to have an exponential uniform convergence, it suffices to find a dominating 
function whose exponential Orlicz expectation (for some exponent p > 0) is 
uniformly bounded over Pu- Thus, if there exist constants a,b > such that 
for every Ph G Ph, the density function fh(x) = 0(e~ a ^ x ^ ), then the dom- 
inating function M(x) = 2\\4>{x)\\ 2 + 2sup geg \\g\\ 2 satisfies the condition of 
Theorem 3.1 for every p < 6/2. 

This basically means that to verify exponential convergence of the method 
of moments, one can simply check the existence of constants a, b as described 
above. Furthermore, for a convenient exponential Orlicz expectation order p, by 
estimating the value of p, one can investigate decay rate and sample complexity 
of the method of moments. Similarly, using the methodology shown here, one 
can articulate M -estimators [10] into the decision-making framework, and use 
the results of this work to obtain important insights about consistency and 
convergence behavior, hence, strong consistency of such estimators. 

4.2 System identification 

Example 2.1 presents a basic system identification problem where the risk func- 
tion is the mean square error. Suppose the probabilistic model &q is a col- 
lection of probability measures such that each measure Pg € has a density 
function f$(x,v) = k(6)e~^^ x,e x )~2\ v \ , where is a positive definite matrix 
of size m + 1 whose eigenvalues are in the interval [a, b] for < a < b and 
k{6) = (2 m + 1 7r m+1 dctfl)- 1 / 2 . For a pair of underlying 9 and h, the density 
function of the induced probability measure, P$,h, ° n the space of observations, 
X x y, is described by 



Since H is compact the collection of probability measures SPq.h is a tight col- 
lection. 




(36) 



J(h,g) < liminf J{h,g k ). 



fe,h{x,y) - fc(0) e -*< x -'" 1 *>-*l«'-< h ' a! >l 2 . 
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Recall that for a given observation pair (x, y), the cost of taking action g G Q 
is Q((x,y),g) = \y — (g,x)\ 2 . Since Q is compact, sup s£ g ||<?|| = A for some 
A < oo. Hence, one can simply define the dominating function as M(x,y) = 
2 1 2/ 1 2 + 2A 2 ||x|| 2 . Thus, for any exponential expectation order p < 1, it can be 
verified that sup Pe ^5 e ||M||^, p = p for some p < go. Since the VC-dimension 
of the class of loss functions is d — m + 1 [4, Ch. 5], the second assertion of 
Theorem 3.1 holds true. This means that for this setup the empirical minimum 
square error system identification is consistent. Furthermore, by estimating the 
value of p, the decay rate of convergence can be estimated. 

4.3 Maximum likelihood estimation 

Example 2.2 articulates a generic maximum likelihood estimation problem in 
the framework of decision-making. Let X be the positive cone in K m , and let 
TL = [a, b] m where < a < b < oo. For probability measure Ph € ^h, let 

f h {x) = k(h)e~^ 

describe its density function where k(h) — n"=i ^i- As a result, the cost of 
taking action g for observation x is Q(x,g) = —lnk(g) + (g,x). Taking the 
dominating function as 

M(x) = m(\]na\ + \]nb\) + y/mb\\x\\, 

one can verify that sup Pe5 a H ||M||^, p < oo for every exponential expectation 
order p < 1. Since the VC-dimension of the class of loss functions is d = m, 
by Theorem 3.1, the empirical maximum likelihood estimation is consistent for 
this setup. With some additional effort to estimate p, one would be able to 
obtain important insights regarding the convergence behavior of the maximum 
likelihood estimation. 



5 Conclusion 

The problem of decision-making with side information using unbounded loss 
functions was considered. A decision-making model was used as a generic 
system-theoretic representation for a broader range of applications in machine 
learning, signal processing, and communications. Sufficient conditions for con- 
sistency of the ERM decision policy with exponential convergence behavior were 
derived. These conditions are also sufficient for strong consistency of the ERM 
decision policy with unbounded loss functions. It was shown that including a 
condition on the growth rate of the class of loss functions, it suffices to have a 
dominating function whose exponential Orlicz expectation is uniformly bounded 
over the probabilistic model. The results verify that the decay constant is simi- 
lar to the decay constant known for the case of bounded loss functions, but the 
decay exponent is strictly smaller than one. 
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